Nearly every aspect of soil survey involves the question: "Is X more similar to Y or to Z?" The quantification of similarity within a collection of horizons, pedons, components, map units, or even landscapes represents an exciting new way to enhance the precision and accuracy of the day-to-day work of soil scientists. After completing this module, you should be able to quantitatively organize objects based on measured or observed characteristics in a consistent and repeatable manner. Perhaps you will find a solution to the long-standing "similar or dissimilar" question.
Most of the examples featured in this whirlwind tour are based on soil data from McGahan, D.G., Southard, R.J, Claassen, V.P. 2009. Plant-available calcium varies widely in soils on serpentinite landscapes. Soil Sci. Soc. Am. J. 73: 2087-2095. These data are available in the dataset "sp4" that is built into aqp package for R.
There are shelves of books and thousands of academic articles describing the theory and applications of "clustering" and "ordination" methods. This body of knowledge is commonly described as the field of numerical taxonomy (Sneath and Sokal 1973). Central to this field is the quantification of similarity among "individuals" based on a relevant set of "characteristics." Individuals are typically described as rows of data with a single characteristic per column, together referred to as a data matrix. For example:
| name | clay | sand | Mg | Ca | CEC_7 |
|---|---|---|---|---|---|
| A | 21 | 46 | 25.7 | 9.0 | 23.0 |
| ABt | 27 | 42 | 23.7 | 5.6 | 21.4 |
| Bt1 | 32 | 40 | 23.2 | 1.9 | 23.7 |
| Bt2 | 55 | 27 | 44.3 | 0.3 | 43.0 |
Quantitative measures of similarity are more conveniently expressed as distance, or dissimilarity; in part because of convention and in part because of computational efficiency. In the simplest case, dissimilarity can be computed as the shortest distance between individuals in property-space. Another name for the shortest linear distance between points is the Euclidean distance. Evaluated in two dimensions (between individuals \(p\) and \(q\)), the Euclidean distance is calculated as follows:
\[D(p,q) = \sqrt{(p_{1} - q_{1})^{2} + (p_{2} - q_{2})^{2}}\]
where \(p_{1}\) is the 1st characteristic (or dimension) of individual \(p\).
There are many other ways to define "distance" (e.g. distance metrics), but they will be covered later.
Using the sand and clay percentages from the data above, dissimilarity is represented as the length of the line connecting any two individuals in property space.
The following is a matrix of all pair-wise distances (the distance matrix):
| A | ABt | Bt1 | Bt2 | |
|---|---|---|---|---|
| A | 0.0 | 7.2 | 12.5 | 38.9 |
| ABt | 7.2 | 0.0 | 5.4 | 31.8 |
| Bt1 | 12.5 | 5.4 | 0.0 | 26.4 |
| Bt2 | 38.9 | 31.8 | 26.4 | 0.0 |
Note that this is the full form of the distance matrix. In this form, zeros are on the diagonal (i.e. the distance between an individual and itself is zero) and the upper and lower "triangles" are symmetric. The lower triangle is commonly used by most algorithms to encode pair-wise distances.
| A | ABt | Bt1 | |
|---|---|---|---|
| ABt | 7.2 | ||
| Bt1 | 12.5 | 5.4 | |
| Bt2 | 38.9 | 31.8 | 26.4 |
Interpretation of the matrix is simple: Individual "A" is more like "ABt" than like "Bt1." It is important to note that quantification of dissimilarity (distance) among individuals is always relative: "X is more like Y, as compared to Z."
Euclidean distance doesn't make much sense if the characteristics do not share a common unit of measure or range of values. Nor is it relevant when some characteristics are categorical and some are continuous. For example, distances are distorted if you compare clay (%) and exchangeable Ca (cmol/kg).
In this example, exchangeable Ca contributes less to the distance between individuals than clay content, effectively down-weighting the importance of the exchangeable Ca. Typically, characteristics are given equal weight (Sneath and Sokal 1973); however, weighting is much simpler to apply after standardization.
Standardization of the data matrix solves the problem of unequal ranges or units of measure, typically by subtraction of the mean and division by standard deviation (z-score transformation).
\[x_{std} = \frac{x - mean(x)}{sd(x)}\]
There are several other standardization methods covered later. The new data matrix looks like the following:
## name clay sand Mg Ca CEC_7
## 1 A -0.86 0.88 -0.35 1.23 -0.47
## 2 ABt -0.45 0.40 -0.55 0.36 -0.63
## 3 Bt1 -0.12 0.15 -0.60 -0.59 -0.40
## 4 Bt2 1.43 -1.43 1.49 -1.00 1.49
Using the standardized data matrix, distances computed in the property space of clay and exchangeable calcium are unbiased by the unique central tendency or spread of each character.
Rarely can the question of "dissimilarity" be answered with only two characteristics (dimensions). Euclidean distance, however, can be extended to an arbitrary number of \(n\) dimensions.
\[D(p,q) = \sqrt{ \sum_{i=1}^{n}{(p_{i} - q_{i})^{2}} }\]
In the equation above, \(i\) is one of \(n\) total characteristics. Imagining what distance "looks like" is difficult if there are more than three dimensions. Instead, examine the distance matrix calculated using all five characteristics.
Rescaling to the interval {0,1}.
You can now begin to describe dissimilarity between individuals using an arbitrary number of (relevant) characteristics. You can make statements like "The A horizon is roughly 2x more similar to the ABt horizon than it is to the Bt horizon." Although this is a trivial example, the utility of generalizing these methods to soil survey operations should be obvious.
Missing data are a fact of life. Soil scientists are quite familiar with missing lab data ("Why didn't we request optical grain counts?") or missing essential NASIS pedon data elements, such as horizon bottom depth, estimated clay fraction, or pH. Nearly all of the methods described in this document are very sensitive to missing data. In other words, they won't work! Following are a couple of possible solutions:
Dendrograms are a convenient approximation of pair-wise distances between individuals (after application of hierarchical grouping criteria; more on this later). Dissimilarity between branches is proportional to the level at which branches merge: branching at higher levels (relative to the root of the tree) suggests greater dissimilarity; branching at lower levels suggests greater similarity. Consider the previous example in which distance between individuals was defined in terms of sand and clay percentages.
Interpretation is simple. Euclidean distance in property-space is directly proportional to branching height in the corresponding dendrogram. Visualizing the geometry of pair-wise distances in more than three dimensions is difficult. A dendrogram, however, can conveniently summarize a distance matrix created from an arbitrary number of characteristics. It is important to note that some information about pair-wise distances is lost or distorted in the dendrogram. Distortion is least near the terminal "leaves" of the dendrogram. This phenomena is analogous to the distortion generated by a map projection. It is impossible to flatten a higher-dimensional entity to a lower-dimensional form without causing distortion.
There isn't much difference between these two figures because most of the characteristics in the example dataset are highly correlated with soil texture. There are some more important details on how individuals are connected into larger and larger groups within the dendrogram.
Cluster analysis is a massive topic that deals with the seemingly simple task of finding useful groups within a dataset. This topic and the methods used are also referred to as "unsupervised classification" in the fields of remote sensing and GIS. All of the available algorithms will find groups in a given dataset; however, it is up to the subject expert to determine the following:
Note that the widespread use of color in the following examples is not for aesthetic purposes. Colors are convenient for tri-variate data-spaces because you can visually integrate the information into a self-consistent set of classes.
Hierarchical clustering is useful when a full distance matrix is available and the optimal number of clusters is not yet known. This form of clustering creates a data structure that can encode "grouping" information from one cluster to as many clusters as there are individuals. The expert must determine the optimal place to "cut the tree" and generate a fixed set of clusters. The results from a hierarchical clustering operation are nearly always presented as a dendrogram.
There are two main types of hierarchical clustering.
Both methods are strongly influenced by the choice of standardization method and distance metric. Both methods require a full, pair-wise distance matrix as input. This requirement can limit the use of hierarchical clustering to datasets that can be fit into memory.
The agglomerative methods also depend on the choice of linkage criterion. Some of these criteria include:
agnes() and recommended by (Kaufman and Rousseeuw 2005),alpha.See (Kaufman and Rousseeuw 2005), (Arkley 1976), and (P. Legendre and Legendre 1998) for a detailed description of these linkage criteria.
Centroid and medoid cluster analyses are commonly referred to as k-means-style analysis. "K-means," however, is just one of many possible clustering algorithms that partition property-space into a fixed number of groups. These type of algorithms can be applied to very large datasets because they do not rely on the distance matrix. Instead, they are based on an iterative shuffling of group "centroids" until some criterion is minimized, for example, the mean variance within groups.
This section describes three (out of many) of the most important partitioning-type algorithms.
All of these methods are sensitive to the type of standardization applied to the characteristics. These methods rely on iterative minimization of one or more criteria; therefore, each clustering "run" may generate slightly different output. Most implementations re-run the algorithm until it stabilizes.
Humans are generally quite good at extracting spatial patterns, almost instantly, from two dimensional fields: faces, written language, etc. Sadly, this ability does not extend beyond two or three dimensions. The term ordination refers to a suite of methods that project coordinates in a high-dimensional space into suitable coordinates in a low-dimensional (reduced) space. Map projections are a simple form of ordination: coordinates from the curved surface of the Earth are projected to a two-dimensional plane. As with any projection, there are assumptions, limitations, and distortions. Carl Sagan gives a beautiful demonstration of this concept using shadows, in this excerpt from Cosmos.
Principal component analysis is one of the simplest and most widely used ordination methods. The reduced space ("principal components") are defined by linear combinations of characteristics. The rest of this chapter is focused on multidimensional scaling (MDS). Methods of ordination that use "Non-metric" multidimensional scaling (nMDS) attempt to generate a reduced space that minimizes distortion in proportional similarity; i.e., similar individuals are near each other in the reduced space, dissimilar individuals are farther apart. The "non-metric" adjective implies that exact distances are not preserved.
Hole and Hironaka (1960) were some of the first pedologists to apply ordination methods to soils data. The main figure from their classic paper was hand-drawn, based on a physical model (constructed from wooden dowels and spheres!) of the ordination.The following example is based on a data matrix containing lab measurements of clay fraction, sand fraction, exchangeable Ca, exchangeable Mg, and CEC measured by NH4-Ac at pH 7.
| name | clay | sand | Mg | Ca | CEC_7 |
|---|---|---|---|---|---|
| A | -0.41 | 0.21 | 0.06 | 0.44 | -0.23 |
| ABt | 0.04 | -0.07 | -0.06 | -0.13 | -0.38 |
| Bt1 | 0.41 | -0.21 | -0.09 | -0.74 | -0.16 |
| ... | ... | ... | ... | ... | ... |
Because this is a complex topic, it is described in a supplemental set of slides.
If you want for more detailed information, see this relevant paper.
This is the fun part.
Install R packages as needed. Open a new R script file to use as you follow along.
# load libraries
library(aqp)
library(soilDB)
library(sharpshootR)
library(cluster)
library(ape)
library(RColorBrewer)
library(vegan)
library(MASS)
library(colorspace)
Most of the examples used in the following exercises come from these sources:
aqp and soilDB packages ("sp4," "gopheridge," and "loafercreek").fetchNASIS(): pedon data from the local NASIS selected set.fetchKSSL(): lab characterization data from the SoilWeb snapshot.fetchOSD(): basic morphologic and taxonomic data from the SoilWeb snapshot.SDA_query(): live SSURGO spatial and tabular data from Soil Data AccessIn most cases, you can edit the examples and swap-in just about any data that are in a SoilProfileCollection object. For example, pedons from your local NASIS selected set can be loaded with fetchNASIS().
Tinker with some SoilProfileCollection objects.
?fetchKSSL) or the SoilProfileCollection tutorial.The aqp package provides two functions for checking the fraction of missing data within a SoilProfileCollection object. The first function (evalMissingData) generates an index that ranges from 0 (all missing) to 1 (all present) for each profile. This index can be used to subset or rank profiles for further investigation. The second function (missingDataGrid) creates a visualization of the fraction of data missing within each horizon. Both functions can optionally filter-out horizons that don't typically have data, for example Cr, Cd, and R horizons.
The following examples are based on the gopheridge sample dataset.
evalMissingData
# example data
data("gopheridge", package = "soilDB")
# compute data completeness
gopheridge$data.complete <- evalMissingData(gopheridge, vars = c('clay', 'sand', 'phfield'), name = 'hzname', p = 'Cr|R|Cd')
# check range in missing data
summary(gopheridge$data.complete)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.9735 0.6172 1.0000 1.0000
# rank
new.order <- order(gopheridge$data.complete)
# plot along data completeness ranking
par(mar=c(3,0,1,1))
plot(gopheridge, plot.order=new.order, print.id=FALSE)
# add axis, note re-ordering of axis labels
axis(side=1, at=1:length(gopheridge), labels = round(gopheridge$data.complete[new.order], 2),
line=-2, cex.axis=0.75, las=2)
title('Gopheridge pedons sorted according to data completeness (clay, sand, pH)')
missingDataGrid
# view missing data as a fraction
res <- missingDataGrid(gopheridge, max_depth=100, vars=c('clay', 'sand', 'phfield'), filter.column='hzname', filter.regex = 'Cr|R|Cd', main='Fraction of missing data (clay, sand, pH)')
# check results
head(res)
## peiid clay sand phfield
## 1 242808 29 29 29
## 2 268791 0 0 0
## 3 268792 0 100 0
## 4 268793 0 0 0
## 5 268794 0 0 0
## 6 268795 0 33 0
For now, extract those profiles that have a complete set of field-described clay, sand, or pH values for later use.
# be sure to read the manual page for this function
gopheridge.complete <- subsetProfiles(gopheridge, s = "data.complete == 1")
# looks good
par(mar=c(0,0,3,1))
plot(gopheridge.complete, color='clay', id.style='side', label='pedon_id')
The following three functions are essential to the creation of a distance matrix:
dist(): This function is in base R, is simple and fast, and has a limited number of distance metrics.daisy(): This function is in cluster package, has a better selection of distance metrics, and has simple standardization. Much more convenient than dist().vegdist(): This function is in vegan package, has many distance metrics, and is primarily designed for species composition data.The following is a short demonstration:
# get some example data from the aqp package
data('sp4', package = 'aqp')
# subset select rows and columns
sp4 <- sp4[1:4, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
row.names(sp4) <- sp4$name
# compare distance functions
# Euclidean distance, no standardization
round(dist(sp4[, -1], method = 'euclidean'))
## A ABt Bt1
## ABt 8
## Bt1 15 7
## Bt2 48 44 39
# Euclidean distance, standardization
round(daisy(sp4[, -1], stand = TRUE, metric = 'euclidean'), 2)
## Dissimilarities :
## A ABt Bt1
## ABt 1.45
## Bt1 2.73 1.36
## Bt2 6.45 5.65 4.91
##
## Metric : euclidean
## Number of objects : 4
# Gower's generalized distance metric (includes standardization)
round(vegdist(sp4[, -1], method = 'gower'), 2)
## A ABt Bt1
## ABt 0.19
## Bt1 0.32 0.16
## Bt2 0.96 0.84 0.69
The following example is excerpted from "A Novel Display of Categorical Pedon Data". This example illustrates an application of clustering binary data (presence or absence of a diagnostic feature). Internally, the diagnosticPropertyPlot function uses the daisy function to compute pair-wise distances using the "general dissimilarity coefficient" of Gower (Gower 1971). A concise summary of this distance metric is in (Kaufman and Rousseeuw 2005).
# load some example NASIS data
data(loafercreek, package='soilDB')
# cut-down to a subset, first 20 pedons
loafercreek <- loafercreek[1:20, ]
# get depth class
sdc <- getSoilDepthClass(loafercreek)
site(loafercreek) <- sdc
# diagnostic properties to consider, no need to convert to factors
v <- c('lithic.contact', 'paralithic.contact', 'argillic.horizon',
'cambic.horizon', 'ochric.epipedon', 'mollic.epipedon', 'very.shallow',
'shallow', 'mod.deep', 'deep', 'very.deep')
# do the analysis and save the results to object 'x'
x <- diagnosticPropertyPlot(loafercreek, v, k=5, grid.label='bedrock_kind', dend.label = 'taxonname')
If you are wondering what is in the object x, the str function or manual page (?diagnosticPropertyPlot) can help.
The go-to functions for hierarchical clustering are as follows:
hclust(): This function is agglomerative, is in base R, requires a distance matrix, and implements most of the commonly used linkage criteria.agnes(): This function is agglomerative, is in cluster package, can perform standardization and distance calculations, and implements more linkage criteria.diana(): This function is divisive, is in cluster package, can perform standardization and distance calculations.hclustThe hclust() function and resulting hclust-class objects are simple to use, but limited.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
# distance matrix
d <- daisy(sp4[, -1], metric = 'euclidean', stand = TRUE)
# hierachical clustering with base function hclust
sp4.h <- hclust(d, method = 'ward.D')
sp4.h$labels <- sp4$name
# plot with basic plotting method... not many options here
par(mar=c(2,4,2,2))
plot(sp4.h, font=2, cex=0.85)
# ID clusters after cutting tree
rect.hclust(sp4.h, 4)
ape PackageThis example uses a different approach to plotting based on functions and classes from the ape package.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
# distance matrix
d <- daisy(sp4[, -1], metric = 'euclidean', stand = TRUE)
# divising clustering
dd <- diana(d)
# convert to ape class, via hclust class
h <- as.hclust(dd)
h$labels <- sp4$name
p <- as.phylo(h)
# define some nice colors
col.set <- brewer.pal(9, 'Set1')
# cut tree into 4 groups
groups <- cutree(h, 4)
# make color vector based on groups
cols <- col.set[groups]
The plot methods for phylo class objects are quite flexible. Be sure to see the manual page ?plot.phylo.
par(mar=c(1,1,1,1), mfcol=c(2,2))
plot(p, label.offset=0.125, direction='right', font=1, cex=0.85, main='dendrogram')
tiplabels(pch=15, col=cols)
plot(p, type='radial', font=1, cex=0.85, main='radial')
tiplabels(pch=15, col=cols)
plot(p, type='fan', font=1, cex=0.85, main='fan')
tiplabels(pch=15, col=cols)
plot(p, type='unrooted', font=1, cex=0.85, main='unrooted')
tiplabels(pch=15, col=cols)
The following example is rather basic. Many more possibilities are available in the dendextend package.
# load sample dataset from aqp package
data(sp3)
# promote to SoilProfileCollection
depths(sp3) <- id ~ top + bottom
# compute dissimilarity using different sets of variables
# note that these are rescaled to the interval [0,1]
d.1 <- profile_compare(sp3, vars=c('clay', 'L'), k=0, max_d=100, rescale.result=TRUE)
# cluster via divisive hierarchical algorithm
# convert to 'phylo' class
p.1 <- as.phylo(as.hclust(diana(d.1)))
# graphically compare diana() to agnes() using d.2
dueling.dendrograms(as.phylo(as.hclust(diana(d.1))),
as.phylo(as.hclust(agnes(d.1, method='average'))), lab.1='divisive', lab.2='agglomerative: average linkage')
The following creates simulated data for demonstration purposes, representing two populations: 1) mean = 0, sd = 0.3, and 2) mean = 1, sd = 0.3.
# nice colors for later
col.set <- brewer.pal(9, 'Set1')
# 2D example
x <- rbind(matrix(rnorm(100, mean = 0, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
par(mfrow=c(3,3), mar=c(1,1,1,1))
for(i in 1:9) {
cl <- kmeans(x, centers=3)
plot(x, col = col.set[cl$cluster], axes=FALSE)
grid()
points(cl$centers, col = col.set, pch = 8, cex = 2, lwd=2)
box()
}
k-means function with default settings
par(mfrow=c(3,3), mar=c(1,1,1,1))
for(i in 1:9) {
cl <- kmeans(x, centers=3, nstart = 10, iter.max = 100)
plot(x, col = col.set[cl$cluster], axes=FALSE)
grid()
points(cl$centers, col = col.set, pch = 8, cex = 2, lwd=2)
box()
}
k-means function after increasing the max number of random starts and iterations
TODO: finish this
# pam()
# clara()
par(mfrow=c(2,3), mar=c(1,1,1,1))
for(i in 2:7) {
cl <- pam(x, k = i, stand = TRUE)
plot(x, col = col.set[cl$clustering], axes=FALSE)
grid()
points(cl$medoids, col = col.set, pch = 8, cex = 2, lwd=2)
box()
}
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4.std <- data.frame(sp4[, c('id', 'name', 'top', 'bottom')], scale( sp4[, c('Mg', 'Ca')]))
# perform fuzzy clustering
cl <- fanny(sp4.std[, c('Mg', 'Ca')], k = 3, stand = FALSE)
# get membership matrix
m <- cl$membership
# convert to colors by interpreting membership as R,G,B values
cols <- rgb(m)
# setup plot
par(mar=c(4,4,0,0))
plot(sp4.std$Mg, sp4.std$Ca, asp=1, ylab='Exchangeable Mg (cmol/kg), Standardized', xlab='Exchangeable Ca (cmol/kg), Standardized', type='n')
abline(h=0, v=0, col='black')
grid()
# add original obs
points(sp4.std$Mg, sp4.std$Ca, bg=cols, col='black', cex=1.25, pch=21)
sp4.std$colors <- cols
depths(sp4.std) <- id ~ top + bottom
par(mar=c(0,0,0,0))
plot(sp4.std, color='colors', cex.names=0.75)
title('Fuzzy Clustering Results in Context', line=-3)
There is no simple answer to the question "How many clusters are in my data?" Some metrics, however, can be used to help estimate a "reasonable" number of clusters. The mean silhouette width is a useful index of "cluster compactness" relative to neighbor clusters (P. Rousseeuw 1987). Larger silhouette widths suggest tighter grouping.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4.std <- data.frame(sp4[, c('id', 'name', 'top', 'bottom')], scale( sp4[, c('Mg', 'Ca')]))
# perform hard clustering
sil.widths <- vector(mode='numeric')
for(i in 2:10) {
cl <- pam(sp4.std[, c('Mg', 'Ca')], k = i, stand = FALSE)
sil.widths[i] <- cl$silinfo$avg.width
}
par(mar=c(4,4,3,1))
plot(sil.widths, type='b', xlab='Number of Clusters', ylab='Average Silhouette Width', las=1, lwd=2, col='RoyalBlue', cex=1.25, main='Finding the "Right" Number of Clusters')
grid()
According to this metric, it looks like 3 clusters is reasonable. Again, this is a judgement call--most decisions related to clustering algorithm selection and the "optimal number of clusters" are somewhat subjective.
# perform fuzzy clustering
cl <- pam(sp4.std[, c('Mg', 'Ca')], k = 3, stand = FALSE)
# setup plot
par(mar=c(4,4,0,0))
plot(sp4.std$Mg, sp4.std$Ca, asp=1, ylab='Exchangeable Mg (cmol/kg), Standardized', xlab='Exchangeable Ca (cmol/kg), Standardized', type='n')
abline(h=0, v=0, col='black')
grid()
# add original obs
points(sp4.std$Mg, sp4.std$Ca, bg=cl$clustering, col='black', cex=1.25, pch=21)
TODO: add limitations here
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
sp4.scaled <- data.frame(name=sp4[, 1], round(scale( sp4[, -1]), 2))
# distance matrix
d <- dist(sp4.scaled[, -1])
m <- as.matrix(d)
dimnames(m) <- list(sp4.scaled$name, sp4.scaled$name)
d <- as.dist(m)
# dendrogram from divisive clustering
dd <- diana(d)
h <- as.hclust(dd)
p <- as.phylo(h)
# define colors based on natural groupings
cols <- brewer.pal(9, 'Set1')[cutree(h, 4)]
# MDS
s <- sammon(d)
plot(s$points, asp=1, type='n', axes=FALSE, xlab='', ylab='')
# abline(v=0, h=0, col='black')
grid()
text(s$points, rownames(s$points), cex=0.75, col=cols, font=2)
box()
mtext('Ordination of Distance Matrix\n(all characteristics, standardized)', side=1, line=1)
vegan PackageTODO: add details here
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
sp4.scaled <- data.frame(name=sp4[, 1], round(scale( sp4[, -1]), 2))
# define colors based on natural groupings
cols <- brewer.pal(9, 'Set1')
# distance calc + ordination
s <- metaMDS(sp4.scaled[, -1], distance = 'gower', autotransform = FALSE, wascores=FALSE)
# plot ordination
fig <- ordiplot(s, type='none', cex.axis=0.75, axes=FALSE, xlab='', ylab='')
abline(h=0, v=0, lty=2, col='grey')
text(fig$sites, sp4$name, cex=0.65, font=2)
# ordicluster(fig, agnes(daisy(sp4.scaled[, -1]), method='ward'), prune=3, col = "orange")
box()
mtext('Ordination of Distance Matrix\n(all characteristics, standardized)', side=1, line=1)
Before you work through the following examples, you should review the SoilProfileCollection object tutorial.
# init example data
data(sp4)
depths(sp4) <- id ~ top + bottom
# eval dissimilarity:
# using Ex-Ca:Mg and CEC at pH 7
# with no depth-weighting (k=0)
# to a maximum depth of 40 cm
d <- profile_compare(sp4, vars=c('ex_Ca_to_Mg', 'CEC_7'), k=0, max_d=40)
# check distance matrix:
round(d, 1)
## Dissimilarities :
## colusa glenn kings mariposa mendocino napa san benito shasta shasta-trinity
## glenn 13.5
## kings 16.0 12.7
## mariposa 8.4 11.3 16.5
## mendocino 11.5 8.0 16.4 15.0
## napa 30.4 24.1 29.4 29.2 21.6
## san benito 25.7 20.6 26.3 28.2 15.8 18.0
## shasta 17.2 13.3 8.7 17.6 17.1 33.7 22.2
## shasta-trinity 6.4 16.6 22.3 9.6 16.5 29.8 27.2 23.3
## tehama 28.7 22.9 27.9 27.3 20.0 8.8 15.1 31.4 27.9
##
## Metric : mixed ; Types = I, I
## Number of objects : 10
# cluster via divisive method
clust <- diana(d)
# vizualize dissimilarity matrix via hierarchical clustering
par(mar=c(0,0,3,0))
plotProfileDendrogram(sp4, clust, dend.y.scale = max(d), scaling.factor = (1/max(d) * 10), y.offset = 2, width=0.15, cex.names=0.45, color='ex_Ca_to_Mg', col.label='Exchageable Ca to Mg Ratio')
The following are demonstrations of pair-wise distances computed from categorical data and the use of a dendrogram to organize groups from Soil Taxonomy. Click here for details.
# define a vector of series
s.list <- c('amador', 'redding', 'pentz', 'willows', 'pardee', 'yolo', 'hanford', 'cecil', 'sycamore', 'KLAMATH', 'MOGLIA', 'drummer', 'musick', 'zook', 'argonaut', 'PALAU')
# get and SPC object with basic data on these series
s <- fetchOSD(s.list)
# graphical check
par(mar=c(0,0,2,0))
plot(s) ; title('Selected Pedons from Official Series Descriptions', line=0)
# check structure of some site-level attributes
head(site(s))[, c('id', 'soilorder', 'suborder', 'greatgroup', 'subgroup')]
## id soilorder suborder greatgroup subgroup
## 1 AMADOR inceptisols xerepts haploxerepts typic haploxerepts
## 2 ARGONAUT alfisols xeralfs haploxeralfs mollic haploxeralfs
## 3 CECIL ultisols udults kanhapludults typic kanhapludults
## 4 DRUMMER mollisols aquolls endoaquolls typic endoaquolls
## 5 HANFORD entisols orthents xerorthents typic xerorthents
## 6 KLAMATH mollisols aquolls cryaquolls cumulic cryaquolls
par(mar=c(0,1,1,1))
# plot dendrogram + profiles
d <- SoilTaxonomyDendrogram(s, scaling.factor = 0.01)
Check the resulting distance matrix.
print(d)
Just for fun, use hierarchical clustering and nMDS on soil color data from the OSDs that were used in the previous example.
# extract horizon data from select OSDs in above example
h <- horizons(s)
# convert Munsell color notation to sRGB
# these are moist colors
rgb.data <- munsell2rgb(h$hue, h$value, h$chroma, return_triplets = TRUE)
# check
head(rgb.data)
## r g b
## 1 0.4360624 0.3706674 0.29697452
## 2 0.5589675 0.4673350 0.35663875
## 3 0.5589675 0.4673350 0.35663875
## 4 0.7719679 0.6774631 0.48997537
## 5 0.3940324 0.2499977 0.16682669
## 6 0.4309729 0.2327690 0.09771028
# remove NA
rgb.data <- na.omit(rgb.data)
# retain unique colors
rgb.data <- unique(rgb.data)
# convert sRGB colors to CIE LAB color system
lab.data <- as(with(rgb.data, sRGB(r, g, b)), 'LAB')
# visualize colors in LAB coordinates
pairs(lab.data@coords, col='white', bg=rgb(rgb.data), pch=21, cex=2)
# create distance matrix from LAB coordinates
d <- daisy(lab.data@coords, stand = TRUE)
# divisive heirarcical clustering
d.hclust <- as.hclust(diana(d))
# convert to phylo class for nicer plotting
p <- as.phylo(d.hclust)
# perform nMDS on distance matrix
d.sammon <- sammon(d)
# setup multi-figure page
par(mfcol=c(1,2), mar=c(0,0,2,0), bg=grey(0.95))
# plot fan-style dendrogram
plot(p, font=2, cex=0.5, type='fan', show.tip.label=FALSE, main='Dendrogram Representation')
# add colors at dendrogram tips
tiplabels(pch=21, cex=4, col='white', bg=rgb(rgb.data))
# plot nMDS ordination
plot(d.sammon$points, type='n', axes=FALSE, xlab='', ylab='', asp=1, main='nMDS Ordination')
abline(h=0, v=0, col='black', lty=3)
points(d.sammon$points, bg=rgb(rgb.data), pch=21, cex=3.5, col='white')
Example borrowed from this tutorial.
library(reshape2)
# set list of component names, same as soil color example
s.list <- c('amador', 'redding', 'pentz', 'willows', 'pardee', 'yolo',
'hanford', 'cecil', 'sycamore', 'KLAMATH', 'MOGLIA', 'drummer',
'musick', 'zook', 'argonaut', 'PALAU')
# set list of relevant interpretations
interp.list <- c('ENG - Construction Materials; Topsoil',
'ENG - Sewage Lagoons', 'ENG - Septic Tank Absorption Fields',
'ENG - Unpaved Local Roads and Streets')
# compose query
q <- paste0("SELECT UPPER(compname) as compname, mrulename, AVG(interplr) as interplr_mean
FROM component INNER JOIN cointerp ON component.cokey = cointerp.cokey
WHERE compname IN ", format_SQL_in_statement(s.list), "
AND seqnum = 0
AND mrulename IN ", format_SQL_in_statement(interp.list), "
AND interplr IS NOT NULL
GROUP BY compname, mrulename;")
# send query
x <- SDA_query(q)
# reshape long -> wide
x.wide <- dcast(x, compname ~ mrulename, value.var = 'interplr_mean')
knitr::kable(x.wide, digits = 3, caption="Mean Fuzzy Ratings for Select Soil Series")
| compname | ENG - Construction Materials; Topsoil | ENG - Septic Tank Absorption Fields | ENG - Sewage Lagoons | ENG - Unpaved Local Roads and Streets |
|---|---|---|---|---|
| AMADOR | 0.000 | 1.000 | 1.000 | 0.752 |
| ARGONAUT | 0.050 | 1.000 | 1.000 | 1.000 |
| CECIL | 0.417 | 0.662 | 0.849 | 0.294 |
| DRUMMER | 0.000 | 1.000 | 1.000 | 1.000 |
| HANFORD | 0.678 | 0.989 | 1.000 | 0.220 |
| KLAMATH | 0.000 | 1.000 | 1.000 | 1.000 |
| MOGLIA | 0.000 | 1.000 | 0.400 | 1.000 |
| MUSICK | 0.167 | 1.000 | 1.000 | 0.909 |
| PALAU | 0.011 | 1.000 | 0.864 | 1.000 |
| PARDEE | 0.000 | 1.000 | 1.000 | 1.000 |
| PENTZ | 0.004 | 1.000 | 1.000 | 0.739 |
| REDDING | 0.033 | 1.000 | 1.000 | 0.862 |
| SYCAMORE | 0.787 | 0.947 | 0.759 | 0.861 |
| WILLOWS | 0.010 | 1.000 | 0.894 | 1.000 |
| YOLO | 0.843 | 0.914 | 0.608 | 0.769 |
| ZOOK | 0.006 | 1.000 | 1.000 | 1.000 |
# note: component name and series name have been converted to upper case
# sort rows of fuzzy ratings based on profiles from OSDs
new.order <- match(x.wide$compname, profile_id(s))
x.wide <- x.wide[new.order, ]
# copy ids to row.names so that they are preserved in distance matrix
row.names(x.wide) <- x.wide$compname
# create distance matrix
d <- daisy(x.wide[, -1])
# divisive hierarchical clustering
clust <- diana(d)
par(mar=c(2,0,2,0))
plotProfileDendrogram(s, clust, dend.y.scale = 1.5, scaling.factor = 0.004, y.offset = 0.1, width=0.15, cex.names=0.45)
title('Component Similarity via Select Fuzzy Ratings')
mtext('Profile Sketches are from OSDs', 1)
This example generates an ordination of environmental variables (PRSIM and elevation) associated with MLRAs 15, 18, 22A, and 22B. The contours contain 75% (dotted line), 50% (dashed line), and 25% (solid line) of the points. See comments in the code for details. Note that this example was extracted from the Region 2 Map Unit Comparison Report.
library(MASS)
library(vegan)
library(cluster)
library(RColorBrewer)
# get example data
# init a temp file
tf <- tempfile()
# download compressed CSV to temp file
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/clustering_and_ordination/MLRA-raster-data-example.csv.gz', destfile = tf, quiet = TRUE)
# read-in from compressed CSV to data.frame object
d.sub <- read.csv(gzfile(tf), stringsAsFactors = FALSE)
# check: note that the column names are quite strange
head(d.sub)
# set factor levels
mu.set <- c('15', '18', '22A', '22B')
d.sub$.id <- factor(d.sub$.id, levels = mu.set)
# define some nice colors
cols <- brewer.pal(9, 'Set1')
# remove light colors
cols <- cols[c(1:5,7,9)]
# http://stackoverflow.com/questions/16225530/contours-of-percentiles-on-level-plot
kdeContours <- function(i, prob, cols, m, ...) {
# need more than 2 rows of data
if(nrow(i) < 2) {
return(NULL)
}
# keep track of colors
this.id <- unique(i$.id)
this.col <- cols[match(this.id, m)]
# estimate density
dens <- kde2d(i$x, i$y, n=200)
# differentiate to estimate quantiles
dx <- diff(dens$x[1:2])
dy <- diff(dens$y[1:2])
sz <- sort(dens$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(prob, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
# add contours if possible
if(!is.na(levels))
contour(dens, levels=levels, drawlabels=FALSE, add=TRUE, col=this.col, ...)
}
## NOTE: data with very low variability will cause warnings
# eval numerical distance, removing first 3 columns of IDs
d.dist <- daisy(d.sub[, -c(1:3)], stand=TRUE)
## map distance matrix to 2D space via principal coordinates
d.betadisper <- betadisper(d.dist, group=d.sub$.id, bias.adjust = TRUE, sqrt.dist = TRUE, type='median')
d.scores <- scores(d.betadisper)
# split data into a list by ID
s <- data.frame(x=d.scores$sites[, 1], y=d.scores$sites[, 2], .id=d.sub$.id)
s <- split(s, s$.id)
# plot
par(mar=c(1,1,3,1))
plot(d.scores$sites, type='n', axes=FALSE)
abline(h=0, v=0, lty=2, col='grey')
# add contours of prob density
# NOTE: lines are not added if data are too densely spaced for evaluation of requested prob. level
res <- lapply(s, kdeContours, prob=c(0.75), cols=cols, m=levels(d.sub$.id), lwd=1, lty=3)
res <- lapply(s, kdeContours, prob=c(0.5), cols=cols, m=levels(d.sub$.id), lwd=1, lty=2)
res <- lapply(s, kdeContours, prob=c(0.25), cols=cols, m=levels(d.sub$.id), lwd=2, lty=1)
# add sample points
points(d.scores$sites, cex=0.45, col=cols[as.numeric(d.sub$.id)], pch=16)
# note special indexing for cases when low-var MU have been removed
ordilabel(d.betadisper, display='centroids', col=cols[match(mu.set, levels(d.sub$.id))])
title('Ordination of Raster Samples (cLHS Subset) with 25%, 50%, 75% Density Contours')
box()
Thoughts on interpretation:
The following is an investigation of topographic "signatures" from the initial soil mapping project of Sequoia and Kings Canyon National Park, CA.
library(plyr)
library(reshape2)
library(vegan)
library(cluster)
# init a temp file
tf <- tempfile()
# download compressed CSV to temp file
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/clustering_and_ordination/seki-mu-gis-samples.csv.gz', destfile = tf, quiet = TRUE)
# read-in from compressed CSV to data.frame object
x <- read.csv(gzfile(tf), stringsAsFactors = FALSE)
# check
head(x)
# get terrain variable names
vars <- c('elev', 'solar', 'slope', 'tci', 'ppt', 'maat', 'tpi', 'tri', 'pi')
# covnvert wide to long format for summary by MY symbol
x.long <- melt(x, id.vars = 'MUSYM', measure.vars = vars)
x.summary <- ddply(x.long, c('MUSYM', 'variable'), .fun=plyr::summarize, m=median(value, na.rm = TRUE))
head(x.summary)
# retain median values, convert back to wide format
x.wide <- dcast(x.summary, MUSYM ~ variable, value.var = 'm')
row.names(x.wide) <- x.wide$MUSYM
head(x.wide, 2)
# generate lookup table of MUSYM - MU groups
lut <- unique(x[, c('MUSYM', 'gensym')])
# join data matrix with lut
x.wide <- join(x.wide, lut, by='MUSYM', type='left', match='first')
x.wide$gensym <- factor(x.wide$gensym, levels = c("riparian", "xx10", "xx20", "xx30", "xx40", "xx50"), labels = c("riparian", "basins and cirques", "glacial valleys", "bedrock controlled mountain slopes and valley walls", "non bedrock controlled slopes", "plateaus, till plains, outwash plains"))
# perform non-metric MDS
nmds <- metaMDS(x.wide[, vars], distance = 'gower', autotransform = FALSE, wascores=FALSE)
# setup colors
cols <- c('RoyalBlue', 'black', 'DarkGreen', 'orange3', 'red', 'brown')
site.colors <- cols[as.numeric(x.wide$gensym)]
# plot ordination
par(mar=c(1,1,1,1))
fig <- ordiplot(nmds, type='none', cex.axis=0.75, axes=FALSE)
abline(h=0, v=0, lty=2, col='grey')
text(fig, "sites", cex=0.65, font=2, col=site.colors)
legend('bottomleft', legend=levels(x.wide$gensym), col=cols, pch=15, pt.cex=2, bty='n', cex=1)
title('Map Unit Terrain Signatures')
This document is based on aqp version 1.10-1 and soilDB version 1.8.1 and sharpshootR version 1.2-1. |
Arkley, Rodney J. 1976. “Statistical Methods in Soil Classification Research.” In Advances in Agronomy, edited by N. C. Brady, 37–69. New York, NY: Academic Press.
Gower, J. C. 1971. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics 27 (4). International Biometric Society: 857–71. http://www.jstor.org/stable/2528823.
Hole, F. D., and M. Hironaka. 1960. “An Experiment in Ordination of Some Soil Profiles.” Proceedings of the Soil Science Society of America 24 24: 309–12.
Kaufman, Leonard, and Peter J. Rousseeuw. 2005. Finding Groups in Data an Introduction to Cluster Analysis. Wiley-Interscience.
Legendre, P., and L. Legendre. 1998. Numerical Ecology. 2nd ed. Developments in Environmental Modeling 20. Amsterdam: Elsevier.
Rousseeuw, P.J. 1987. “Silhouettes: A Grapical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathmatics 20: 53–65.
Sneath, Peter H. A., and Robert R. Sokal. 1973. Numerical Taxonomy. W.H. Freeman; Company.